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Abstract 

Recent advances in adaptive Markov chain Monte Carlo (AMCMC) include 
the need for regional adaptation in situations when the optimal transition kernel 
is different across different regions of the sample space. Motivated by these find- 
ings, we propose a mixture-based approach to determine the partition needed 
for r egional AMCMC. The mixture model is fitted using an onl ine EM algorithm 
(see Andrieu and Moulines . 2006 : Cappe and Moulines . 20Qg| ) which allows us 



to bypass simultaneously the heavy computational load and to implement the 
regional adaptive algorithm with online recursion (RAPTOR). The method is 
tried on simulated as well as real data examples. 

Keywords: Adaptive MCMC, regional adaptation, online EM, mixture model. 



1 Introduction 



In recent years, the Markov chain Monte Carlo (MCMC) class of computational algo- 
rithms h as been enriched wit h adaptive MCMC (AMCMC). Spurred by the seminal 

(120011 ) an increasing body of literature has been devoted to 



paper of 



Haarioetal 
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the study of AMCMC. It has long been known that the fine tuning of the proposal 
distribution's parameters in a Me t ropolis sampler is c e ntral to the performance o f 



the algorithm. 



Haario et al. 



fl200lh.lHaario et all fl2005h . 



Andrieu and Moulinesl (120061 ). 



Andrieu et al 



Andrieu and Robert 



(2001). 



(120051 ) and lRoberts and Rosenthal! (120071 ) 
have provided the theory needed to prove that it is possible to adapt the parameters 
of the proposal distribution "on the fly", i.e. while running the Markov chain and us- 
ing for tuning the very samples produced by the chain. The AMCMC algorithms may 
be vulnerable to the multimodality of the target distr ibution and more care needs to 



be taken in implementing the AMCMC paradigm. In lCraiu et al.l (120081 ) a few possi- 
ble approaches are discussed, central among which is the regional adaptive algorithm 
(RAPT) designed for Metropolis samplers. However, the premise for RAPT is that 
a partition of the sample space is given and it is approximately correctly specified. 
While sophisticated methods exist to detect the modes of a multimodal distribu- 



tion (see 



Sminchisescu and Triggd . 



2001 



2002 



iNeal 



20011 1 it is not obvious how to 



use such techniques for defining the desired parti tion of the sample space. We fol low 



here the methods of 



Andrieu and Moulinesl (120061 ) and lCappe and Moulinesl ( 120091 ) to 



propose a mixture-based approach for adaptively determining the boundary between 
high probability regions. We approximate the target distribution using a mixture of 
Gaussians whose parameters are used to define the partition. The theoretical chal- 
lenges lie in the fact that the volume of data used for fitting the mixture increases as 
the simulation progresses and the data is not independent sin ce it is made of realiza- 
tions o f a Markov chain. Both challenge s have been tackled by 



Andrieu and Moulines 



Cappe and Moulinesl (120091 ). 



(120061 ) and 

In the n ext section we briefly review the RAPT algorithm and the online EM 
algorithm of ICappe and Moulinesl (120091 ) . In section 3 we describe the methodology 
behind the regional adaptive algorithm with online recursion (RAPTOR). The simu- 
lation studies and real data application are discussed in Sections 4 and 5, respectively. 
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2 Regional Adaptation and Online EM 



2.1 Regional Adaptation (RAPT) 

Regional adaptation is motivated by the fundamental and natural idea that, in many 
situations, the optimal proposal distribution used in a Metropolis sampling algorithm 
may be different in separate regions of the sample space S. For now, assume that we 
are given a partition of the space S made of two regions <Si,c>2. The mixed RAPT 
algorithm for a random walk Metropolis (RWM) sampler uses the following mixture 
clS cL proposal distribution 

2 

Q{x,dy) = (l-[3)J2 1 S i ( x )[ X i ) Qi&dy) + XfQ 2 (x,dy)] + f3Q whole (x,dy), (1) 
i=i 

where Qi is adapted using samples from Si and Q w hoie is adapted using all the samples 
in S. The mixing parameters \\ , i = 1,2 are also adapted while the parameter (3 is 
constant throughout the simulation. Details regarding t he adaptation proc edures for 
the above distributions and parameters can be found in ICraiu et al.l (120081 ) who also 
provide a proof regarding the asymptotic convergence of the algorithm. 

One can see that, regardless of the region the chain is currently in, the proposal 
distribution is a mixture of three distributions: Qi,Q2 which are approximately op- 
timal choices for the target restricted to Si and S 2 , respectively, and Q w hoie, which 
has the purpose of ensuring good traffic between the two regions. The reason we 
use a mixture with these three components (as opposed to using a mixture with the 
components Qi and Qwhoie when the chain is in Si) is intuitively motivated by the 
uncertainty of determining the ideal partition S = S± U £2- The degree of success 
for RAPT depends on whether the partition used is a relatively good approximation 
of the ideal one. In the next section we propose to adaptively modify the partition 
between the two regions using the online EM algorithm. 



2.2 Online EM 

Denote ir the target distribution of interest. Working under the assumption that it 
is multi-modal one can try to approximate tt using a mixture of Gaussian distribu- 
tions. The approximation is in many cases accurate once the distribution it can be 
well approximated by a Gaussian in a neighborhood of each local mode. The analysis 
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of mi xture models h as relied for a while now o n the EM algorithm (IDempster et al 
19771 ) as discussed by lTitterington et al.l (119851 ) and references therein. In the MCMC 
setup the amount of data available to fit the mixture increases as the simulation pro- 
gresses, therefore making unfeasible the traditional implementation of the algorithm. 
An added complication is that the streams of data contain dependent realizations 
as they are produced by running one or more Markov chains. Both difficulties are 
dealt with effectively by lAndrieu and Moulined (120061 ) who propose an online EM 
algorithm that updates the pa rameter estimates become available. The 

algorithm is further refined by ICappe and Moulined (120091 ) . 

The M-step for the classical EM algorithm involves the maximization (in 9) of 



Q*(0i :n ;0) = J>[log/(Xi; 



i=l 



where Y Vn are the n- dimensional observed da t a and Xi is the i-th unit complete data. 



Andrieu and Moulined (120061 ) modify the Q function to 



The online EM of 

Qn + l(0) = Q n {e)+ ln+1 [E L [l0gf(X n+1 . 



\Y n+1 ] - Q n {6) ) (2) 



and set n+ i as its maximizer. Here n is the size of the sample y\._ n available at the 
n-th iteration. Note that the volume of available data increases at each iteration of 



the algorithm while t he weights 7„. are set to decre ase with n. For additional 



we refer the reader to 



Andrieu and Moulined (120061 ) and lCappe and Moulines 



details 



(2009). 



3 Mixture based boundary adaptation 



3.1 An illustrative example 

Consider the curved density of general form (see 



Roberts and Rosenthal 



2006|): 



f(x\ B) oc exp 



-a;?/200 - -(x 2 + Bx\ - 100S) 1 



(xl + x 2 4 + . 



+ 4) 



For illustration, we consider here the 2-dimensional version of 



(3) 

and in section 14.31 



we will perform a simulation study for the 5-dimensional version of ([3]). In Figure 1(a) 



the contour plot for i? = 0.1 is shown. The correlation between the two coordinates is 
close to 0, so a standard adaptive RWM algorithm may use a nearly spherical, largely 
overdispersed Gaussian distribution. 
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(a) Target prob. density (b) Chains scatterplot (c) Final mixture estimate 

Figure 1: Example of regional adaptation applied to a curved target distribution. 



Within the RWM framework we can gain efficiency by splitting the state space 
horizontally into two regions, and adapting the covariance matrices in each region. 
Using the online EM algorithm for the MCMC sampling output, we can fit a mixture 
of distributions to n, and adapt the chain's transition kernel according to the mixture 
parameter values. 

We run 10 parallel chains of our RAPTOR algorithm for 25000 iterations, using 
the first 1000 as burn- in and allowing exchange of information between chains (see 



Craiu et al. 



20081 ). In Figure 1(b) we show the scatterplot of the values obtained using 



all the ten chains. The final Gaussian mixture estimate is plotted in Figure 1(c) Here 
we can see that the final mixture fit mimics well the target density. 

We design RAPTOR so that it exploits the Gaussian mixture approximation and 
increases the sampling efficiency while adding little computational overhead. In the 
next section we discuss how the Gaussian mixture approximation can be used to: i) 
define a convenient partitioning of the state space and ii) tune the proposal distribu- 
tion of the Metropolis sampler within each region. 

3.2 RAPT with online recursion 

Consider the K components mixture model: 

K 

£fl}JV(x;/i*,E*) W 
k=i 

where N(; fi, S) is the probability density of a Gaussian distribution with mean [i and 
covariance matrix S. In standard mixture modelling terminology, fflh is called the 



'incomplete' likelihood, and the complete likelihood is written as follows: 

K 

/,(x,z) = n^(x; /U ;,Sj)] 1{ ^ ) (5) 
fc=i 

where z is an unobserved labelling variable taking values in the finite set {1, 2, ... , K}. 
For notational convenience all the parameters involved in the model are included in 
the vector r\ = {(P*, k = 1, . . . , K}, with r\ G Q. We propose to approximate 

the target distribution it with q v so that the Kullback-Leibler distance between ir and 
q v is minimized. Given the approximation (j3J) to tc we define the region as the set 
in which the k-th component of the mixture density q v dominates the other ones., i.e. 

Sj = {x : arg max fc ,iV(x; /xj', Sj') = fc}. (6) 

The implicit assumption is that, in Sr, tc is well approximated by a Gaussian distri- 
bution with mean jjt and covariance matrix This approximation can be exploited 
in the definition of the local Metropolis proposal distribution. 

Note that the mixture parameters 0* are omitted from the boundary definition ([6]). 
We also do not exclude components with small weights. It should be also noted that 
in the current approach K is fixed and its choice can be based on an exploratory 
numerical analysis of the target tc (e.g., the number of local maxima of tt). 

We expect that the recurrent update of the boundary between regions will eventu- 
ally lead to regions that are optimal or close to optimal. Perhaps more importantly, 
this approach provides a general strategy to tackle the tricky issue of partitioning the 
sample space. Although in principle we could continue to use the proposal distribution 
([1]), a good partition of the sample space allows the use of 

K 

Q v (x, dy) = (1 - a) V) 1 c* (x)N(y; x, e d Sj)d?/, +aN(y; x, e d Y%)dy (7) 
k=l 

where is the marginal variance of q v , = 2.38 2 /<i, a c hoice based on the optimality 



result s obtained for the RWM by iRoberts et al.l (119971 ) and 



Roberts and Rosenthal 



(120011 ). and a G (0, 1) is a fixed weight which controls the flow between regions. 

The transition kernel ([7j) depends on the mixture parameters 1] in two ways: via 
the regions definition (Q and, more directly, via the covariance matrices and £™. 
The adaptation strategy consists in replacing at each iteration, say nth, the parameter 
r\ with an estimate r] n which is obtained from the chain's realizations observed so far. 
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(a) Different means, equal (b) Different means, different (c) Equal means, different vari- 
variances variances ances 

Figure 2: RAP TOR- defined regions for different relative values of the mixture com- 
ponents parameters. Region 1 in dark gray, region 2 in light gray. 

In Figure [2] some actual shapes of the boundary between two regions as specified 
by (J6j) are shown. It can be seen that the boundary has a good level of flexibility, and 
can represent both convex and concave regions. Indeed, regions can also have 'holes' 



as seen in Figure 2(c) 



3.3 The online EM for RAPTOR 

If v\ = P{Zi = k\xi,f]i) then 



V i ~ Qk' xl„...k> s^k' \> 



Efc'A-i<K^;^-i> s ri v 

where 7] n = \i k n , E*), k — 1, . . . , K}. If we define s\ 



4 = \ YU = (i - Vn)4-i + l/n (9) 



then the recursive estimator ^ n = E*) : fc = 1, . . . , X} is 

= s k 



fe _ l/n Y,ti u i x i 

- ~ k , (10) 



v fc _ V 72 J2i=l y] i x i x 'i ,.k,,kl 
u - — I h L nh L n ■ 



The scheme fl8l)-( fT0|) d efines an online EM whose convergence has been proved 



by lAndrieu and Moulined (120061 ). They have shown that, under mild regularity con- 



7 



ditions on 7T, the estimator defined by fl8|)- f|T0|) converges to the val ue of r? which mini- 



Andrieu and Moulines 



mizes the Kullback-Leibler divergence between n and q v . Moreover, 
(120061 ) proved also the ergodicity of an adaptive independent Metropolis sampler 
whose proposal parameters are the estimates produced by the online EM. The de- 
tailed derivation of equations fl8l- (fT0]) i s shown in appendix \X 



We should note that Remark 8 in 



Andrieu and Moulines 



direct extendability of their proof to the curren t RWM setting. There 
replicate the proofs here and refer the reader to 
theoretical groundwork. 



2006) points out the 



'o re, we do not 



Andrieu and Moulinesl (120061 ) for the 



3.3.1 Inter-Chain Adaptation extension 

The recursive estimation scheme defined above can be easily extended t o the context 



Craiu et al 



of mu ltiple parallel chains, allowing inter-chain adaptation (INCA, see 
2008k 

Denote the MN samples obtained from M parallel chains by {{A] 11 }, 1 < m < 
M, 1 < t < N}. For each N, we can build a pooled chain using, for any 

1< k < MN, Y h = Xf®, where z(Jfe) = k-M[j(k)-l] and j{k) = L^f^J • We apply 



the recursive estimation scheme ( TTUT) to the sequence without modifications. 



4 Simulations 

In this section, we illustrate the performance of the RAPTOR algorithm using Gaus- 
sian mixtures under different scenarios designed to cover a wide range of possibilities. 
In addition, we test RAPT OR on an irregular ly sh aped target distribution whic h 



has been already studied in lHaario et al.l (120011 ) and iRoberts and Rosenthal! (120061 ). 
In this scenario, the target probability density has only one mode and the domain 
is well connected, so that there is no real risk for a standard Metropolis algorithm 
of remaining trapped in one region of the state space. However, we will show that 
even in such cases regional adaptation, in particular RAPTOR, improves over the 
non-regional Adaptive Metropolis algorithm. 



S 



4.1 Algorithms comparison 

In the following, we will compare different Metropolis algorithms using the following 
summaries: 

(I) Acceptance Rate (AR), 

(II) Mean Squared Error (MSE) of the sample mean estimator, 

(III) Bias of the sample mean estimator, 

(IV) Distance between the target cumulative distribution function (CDF) and the 
empirical cumulative distribution function (ECDF) . 

We propose to use (IV) as a more comprehensive indicator of the sampling effi- 
ciency, compared to (III) which summarizes only the first two moments of the Monte 
Carlo estimator. Evidently, the main caveat of (IV) is that it cannot be used in real 
applications when the target CDF is not known. 

For numerically evaluating the distance between an EC FD calculated using the 



1993 



Chen et al. 



2000) and the target CDF, we 



MCMC output (see lSen and Singer! , 
introduce the index 

D n = J \F n -F\ 2 dF } (11) 
where F n is the ECDF obtained using {X t }i< t < n , i.e., 

1 n 

F n {z) = -Y j l{X t <z}. (12) 

In cases where it's easy to get i.i.d. samples from F, the integral in (jlip can 
be computed numerically by Monte-Carlo simulation. More precisely, given a set 
{yi, . . . , Dm} of i.i.d. draws from F, we approximate D n using 

1 M 

3=1 



|2 



Note that, in the above formula, the algorithm under evaluation is involved through 
the ECDF F n , while the target CDF is used both in F and in the generation of the 
sample {yj}j- The encompassing nature of the index is obvious, as in practice the 
objective of the MCMC procedure is precisely to get good samples from F. Moreover, 
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the measure D n is appealing because by integrating with respect to F we give more 
weight to regions of the state space with higher probability, and automatically ignore 
discrepancies between F n and F in zones which are of low interest. 

For simplicity, we will use the notation D n even when its Monte-Carlo approxi- 
mation is used instead. In practice, we will report D n , the average of B independent 
replicates of D n , i.e. 

A* = ^X>i 6) - ( 13 ) 

6=1 

4.2 Gaussian mixture target distribution 

In this section the target distribution is a Gaussian mixture 

f{x- e, d, S) = £N(x- -d x 1, Is) + (1 - t)N(x; dxl,Sxl 5 ), (14) 

where (,ci,SeR and N(; fi, S) is the probability density of a 5-dimensional Gaussian 
distribution with mean \x and covariance matrix E. For increasing values of d, the 
target distribution presents two modes which are more and more separated and S is 
the ratio between the marginal variances of the two mixture components. A priori, 
we expect RAPTOR to make a difference when d is at least moderately large. 
We compare 4 different adaptive RWM algorithms: 

• RAPTOR 

• RAPT, with boundary {x\ + x 2 = 0} 

• RAPT, with boundary {x\ + x 2 = 2} (named RAPT2 in the following) 



Adaptive Metropolis (AM) flHaario et all 120011 ) 



We have run 10 chains in parallel with randomized starting values, each for a 
total of 10000 iterations, using the first 5000 as a burn-in, and allowing sharing of 
information between chains (see sec. 13.3. ip . The simulation has been replicated 200 
times. Initial values for local means and covariance matrices have been set as follows: 

fj, = 1.5 x fJ, tme , (3q = 0.5, Sq = 0.5 x S^ ruc , Sq = 5.0 x S^ rue (15) 

For all algorithms, these values have been used for setting the starting proposal 
covariance matrices. For RAPTOR, these have been also used as starting parameters 
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d=3, 


S=l 


d= 


:0, S = 4 


Algorithm 


AR 


MSE 


AR 


MSE xlOO 


RAPTOR 


0.2485 


0.0813 


0.3092 


0.1888 


RAPT 


0.2477 


0.1239 


0.2747 


0.2410 


RAPT2 


0.2430 


0.1309 


0.2687 


0.3346 


AM 


0.0937 


0.1671 


0.2739 


0.5837 



Table 1: Gaussian mixture target distribution: MSE and acceptance rates in two 
different scenarios, £ = 0.5. 

estimates. In each simulation and for each algorithm we report the mean squared error 
(MSE) and the acceptance rates. These were computed based on the 200 replications 
of the simulation. The results are reported in Table [TJ 

For £ = 0.5, d = 3 and S = 1, all the regional adaptive algorithms reach an 
average acceptance rate of around 24% while AM remains below 10%. Also in terms of 
MSE, all the regional adaptive algorithms outperform the simple adaptive Metropolis. 
However, here we see that RAPT with the boundary {x\ + X2 = 0} has a slightly 
smaller MSE than RAPT2 which uses the boundary {x\+X2 = 2}, and that RAPTOR 
performs better than both, lowering again MSE by more than 30%. 

Encouraging results have been obtained also for the scenario with two identi- 
cal target mixture means, same weights, but different variances. Here the optimal 



mixture-based boundary has the shape showed in Figure 2(c) , so that local RAPTOR 
proposals have smaller steps in the center of the distribution and bigger steps in the 
tails. The global proposal induces jumps with a length between the lengths produced 
by the two local proposals. In this scenario, the boundary produced using RAPTOR 
differs drammatically from that of RAPT and RAPT2, and this yields an efficiency 
gain resulting in a 21% decrease in the MSE of the sample mean estimator and a 
12.5% improvement of the average acceptance rate over RAPT. 



4.3 A curved target distribution 

We consider the probability density given in equation ([3]) in the case of 5 dimensions. 
We run each chain 500 times, with starting conditions randomly drawn from a uniform 
distribution on the hypercube (—2; 2) 5 . For all methods, we used the first 4000 
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RAPTOR RAPT RAPT2 AM 

11.23 11.22 8.81 4.84 

Table 2: Curved target distribution simulation: acceptance rates (%) averaged over 
500 independent runs of each chain 

X\ X2 A3 

RAPTOR 4.1229 8.7565 0.0099 

RAPT 4.6342 8.2658 0.0098 

RAPT2 4.0647 17.3305 0.0146 

AM 4.1506 11.7487 0.0235 



Table 3: Curved target distribution simulation: MSE of the estimator of the mean of 
the first 3 coordinates, for each algorithm. Estimates are based on 500 independent 
chains replications. 

iterations as a burn-in. We compare again the same four different algorithms: 

• RAPTOR 

• RAPT, with boundary fixed to {xi 

• RAPT, with boundary fixed to {x 2 

• AM 

For all the algorithms, starting parameters values were determined on the basis 
of a preliminary simulation stage, common to the 500 replications. We have run 
4000 iterations of a Gaussian Metropolis Random Walk to get initial estimates of 
the target distribution covariance matrix, as well as initial estimates for a Gaussian 
mixture approximation, obtained by running a classical EM algorithm. The weight 
a have been set to 0.2 in RAPTOR as well as in both RAPT implementations. 

In table [2] we report the acceptance rates for the four different sampling strategies. 
We see that all the three regional adaptive methods outperform the simple Adaptive 
Metropolis. The best performance is achieved by RAPTOR and the RAPT with the 
vertical boundary {x\ = 0}. Indeed, we will see that RAPTOR tends to approximate 
the RAPT boundary very well. 



= 0} 

= — 1} (named RAPT2 in the following) 
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Figure 3: Log-average distance from the curved target distribution 





Region 1 






Region 


2 




Algorithm 


p\ 


Y 1 


Pl,2 


* 2 
Pi 


Y 2 
^1,1 




f>\,2 


RAPTOR 


-7.103 


25.421 


0.944 


7.619 


24.593 




-0.952 


RAPT 


-7.159 


24.834 


0.951 


7.293 


25.304 




-0.951 



Table 4: Curved target distribution: average local estimates 
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Figure 4: Curved target distribution: average marginal variance estimates 
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The MSE for the mean estimates shows that all the tested algorithms are roughly 
equivalent in the first coordinate, while RAPTOR and RAPT achieve the best perfor- 
mances on the second coordinate, with RAPT having the best score, and RAPTOR 
closely following. The results also emphasize the importance of defining the regions 
with relative accuracy as RAPT2 is less efficient than AM. 

In Figure [3] we plot the D n index averaged over the 500 chains replicates. One 
can see that RAPT output, on average, approximates the target CDF better than the 
other 3 algorithms, with RAPTOR following very closely. The conclusions are similar 
to those based on MSE as AM and RAPT2 provide less accurate approximations than 
RAPT and RAPTOR. 

In all the 4 algorithms, the proposal distribution uses the estimates of the mixture's 
components variances. In Figure H] we show the average trend of these estimates for 
the first coordinate of the state space. Here we see that RAPTOR, RAPT and AM 
rapidly converge towards similar values, while RAPT2 gets stuck on slightly smaller 
values. 

In almost all diagnostics (acceptance rates, MSE, D n index), RAPTOR showed 
performances very similar to those of RAPT. Indeed, the estimates of the region 
specific means and variances resulted to be very similar in the two algorithms. In 
Table H] we report the average final estimates of the mean and variances of the first 
coordinate in the two regions, as well as the average estimated correlation between 
the first and the second coordinate. The value of the estimates shown in Table H] help 
us determine that the partition selected by RAPTOR is the same optimal partition 
we have explicitely chosen for RAPT. 

5 Real Data Example: Genetic Instability of Eso- 
phageal Cancers 

We analyzed the "Loss of H eterozygosity" (LOH ) dataset from the S eattle Barrett's 



Esop hagus resea r ch pr oject (IBarrett et all Il996l ). already analyzed in IWarned (120011 ) 



and 



Craiu et al. 



(120081 ). We refer to these papers and references therein for a detailed 
description of the data. The dataset is composed by 40 measures of frequencies of 
the event of interest (LOH) with their associated sample sizes. The model adopted 
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for those frequencies is a mixture model, as indicated by iDesail (120001 ): 

Xi ~ r]Binomial(A^, 7i"i) + (1 — r])Beta-Binomial(A^, 7r 2 , 7) (16) 
with priors: 

■q ~ Unif[0,l], 
7Ti ~ Unif[0,l], 

(17) 

7T 2 ~ Unif[0,l], 
7 ~ Unif[-30,30], 

where r\ is the probability of a location being a member of the binomial group, -K\ is 
the probability of LOH in the binomial group, 7r 2 is the probability of LOH in the 
beta-binomial group, and 7 controls the variability of the beta-binomial group. The 
parametrization adopted for the Beta-Binomial distribution is such that 7's range is 
the real line. As 7 — > — 00 the beta-binomial becomes a binomial and as 7 — > 00 the 
beta-binomial becomes a uniform distribution on [0, 1]. In order to facilitate the use 
of the RWM we have used the logistic transformation on the parameters rj, 717, 7r 2 . 

We run 10 parallel chains of the RAPTOR algorithm, allowing exchange of in- 
formation between chains using INCA. The starting points for these chains were 
drawn from a quasi-random distribution uniformly covering the hypercube [0.1, 0.9] 3 x 
[—20, 20]. All the chains were run for 200000 iterations, using the first 10000 as burn- 
in. The factor a which controls the relative importance of the global vs. the local 
proposal jumps has been set to 0.7. In our experiments, the RAPTOR chains dis- 
played good performances even for smaller burn-in lengths and different values of 
a. However, setting a relatively big value of the burn-in guarantees a less erratic 
behaviour of the chain between simulation replications, while a relatively big value 
of a ensures a faster learning of the relative importance of the two target mixture 
components. 

In Figure [6] the traces of the coordinate 717 of the 10 parallel chains are reported. 
Here one can see that all the chains switch very often back and forth between the two 
posterior modes. 

In Figure [5] we show the marginal scatterplot of (7ri,7r 2 ) for all the samples ob- 
tained using the 10 parallel chains. In this plot the differences between the mixture 
components of the target distribution are clear. In a situation like this, one single 
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setup for a RWM proposal distribution over the whole state space would be highly 
inefficient, while a regional Adaptive Metropolis would use different parameters val- 
ues in each of the two regions. Moreover, by using RAPTOR, the regions can be 
identified automatically, without additional input. In the following, we will label as 
region 1 the region with lower tti mean value, and as region 2 the region with bigger 
7Ti mean value. 

It is difficult to visualize the partition produced by RAPTOR in the four-dimensional 
space so instead we choose to show slices of the partition. In general, if the partition 
is defined according to ([6]) then for a fixed subset / of the coordinates of interest 
and after fixing xi = (xj : j G I) at say, xj we can consider the slice through S k 
determined by xi as 

<S fc (a;/) = {x/ c : arg max fc ,A^(x = (ij, x IC ) = k}, (18) 

where I c is the complement of set I. We can also define S k c the projection of S k on 
the x/c-coordinate space and then 



S% = {JS k ( Xl ) 



where the union is taken over all the possible values of Xj. One must choose which 
slices are more informative to look at and in general we choose x to correspond to 
the local modes of tt. In Figure [7] bi-dimensional slices of the RAPTOR regions are 



plotted, for values for r\ and 7 equal to their means in region 1 (Figure 7(a)), region 



2 (Figure 7(b) ) and in the whole state space (Figure 7(c) ). We can see that region 2 
is generally smaller than region 1, and that it gets a bigger area for values of 77 and 7 
around their mean in that same region. In general, it divides well the two posterios 
probability masses, allowing for an effective application of the Regional Adaptive 
Metropolis scheme as implemented in RAPTOR. 

In table we summarize final RAPT OR estimates on th e original scales, and 



Craiu et al. 



(2008). In this table we can 



compare them with the results reported in 
see that the results are quite similar, despite the different definitions of the boundary 
between the two regions. This is probably due to the fact that the two posterior modes 
are separated by a relatively large region of low probability, so that a certain degree of 
variability in the boundary specification is allowed, without affecting the results too 
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■It, 



Figure 5: LOH data simulation: marginal scatterplot 0/(711,^2). 
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Figure 6: LOH data: parallel traces of ir 1 . Dotted horizontal line separates the two 
modes. 
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(a) (b) (c) 

Figure 7: LOH data simulation: slices of final RAPTOR boundaries estimates for 
values of rj and^j equal to their mean in region 1 (left), region 2 (center), whole state 
space (right). Horizontal axis: 7i"i; vertical axis: 7r 2 . Dark gray: region 1; light gray: 
region 2. 





S 1 


S 2 


whole space 




S 1 


S 2 


whole space 


V 


0.917 


0.042 


0.840 


V 


0.897 


0.079 


0.838 




0.227 


0.949 


0.276 


7Tl 


0.229 


0.863 


0.275 




0.768 


0.238 


0.690 




0.714 


0.237 


0.679 


7 


12.187 


-13.249 


10.336 


7 


15.661 


-14.796 


13.435 



Table 5: Simulation results for LOH data. Region specific and global parameters 
means for RAPTOR (left) and RAPT (right). 

much. However, it must be noted here again that RAPTOR carries the advantage 
that the boundary has been learned automatically, with no prior information input. 



6 Conclusions 



We propose a mixture-based approach for regional adaptatio n of the random walk 



Andrieu and Moulines 



Metro polis algorithm. Using the theoretical foundations laid by 
( 120061 ) we use the online EM algorithm to adapt the parameters of a Gaussian mix- 
tures using the stream of data produced by the MCMC algorithms. In turn, the 
mixture approxim ation is used within the regional adaptation paradigm defined by 



Craiu et al. 



(120081 ). The main purpose of the current work is to provide a general 



method for defining a relatively accurate partition of the sample space. Our simula- 
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tions suggest that the approach produces partitions that are very close to the optimal 
one. 



A The Online EM algorithm 

In the following, we will show how the Online EM algorithm presented in 



Andrieu and Moulines 



( 120061 ) applies to the RAPTOR implementation presented in se ction [3T3l 



Consider the following exponential family mixture model (cfr. 



Andrieu and Moulines 



(|2006j), pag. 1488): 

f v (x,z) =exp{-tfj(r]) + (T(x,z),(j)(r])}} (77, x, z) 6 SI x X x Z (A-l) 

where the density f v is defined w.r.t. some convenient measure on SI x X x Z. T(x, z) 
is the complete data sufficient statistic, i.e. the likelihood f v is a function of the 
complete data pair (x,z) only through the statistic T(x,z). Denote with q v (x) the 
marginal density of f v : 

Qr,( x ) = J f v {x,z)iJ,(dz) (A-2) 
One special case of (I A- II) is the finite mixture of Gaussians: 

K 

f v (x,z) = n l^N{x-^)] 1(z=k) (A-3) 
fc=i 

It can be verified that ( 1A-3I) is indeed a special case of ( 1A-1I) where T(x, z) takes 
the form: 

T(x,z) = {l{z = k} ■ (1, x, xx T ), k=l,...,K} (A-4) 

In the above, x denotes a column vector of dimension d and xx T is the usual matrix 
product. The marginal density q v (x) takes the well known form: 

K 

q v (x) = J2^N(x;^) (A-5) 
The central point in the classical EM algorithm is estimating the expected value 



of the complete log-likelihood flA-lj) conditional on X and a value rj' of 77. Thus we 
need to estimate 

E{log(f n (X,Z))\X,r)'} (A-6) 
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Since the complete log-likelihood is linear in T(X, Z) this reduces to computing the 
conditional expected value of the sufficient statistic. We start by deriving the condi- 
tional distribution of Z given X and rj: 

fv(x,z) 



u v {x,z) :-- 



q v (x) 
/3*JV(x;^,E*) 



Now we can define: 

u v T(x) := J T(x, z)u v (x, z)fx(dz) 

(A-8) 

= ^(x, k) T(x, k) 
k=i 

which is the expected value of the complete data sufficient statistic given X and 77. 
This can be estimated recursively by the following stochastic approximation scheme: 

9 n+1 = (1 - a n+ i)O n + a n+1 u Vn T(X n+1 ) (A-9) 

where 9 n G 6 = T{X, Z). The Maximization Step is the same as that in the classical 
EM setup 

PL = 0°n k 
nk _ e 

^Vn ~ ~^k (A- 10) 

o2,k 

yk n _ k kl 

where we have expressed 9 as: 

= {(9°' k , 9^ k , 9 2 > k ), k = 1, . . . , K} (A-ll) 

with 9°< k e TZ, 9 l > k e 1Z d , 9 2 ' k e 1Z dxd . Different choices are possible for the learning 
weights a n . One possibility which guarantees convergence is a n = n -1 , and this is 
indeed what is used in the RAPTOR implementation. 
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